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Abstract 

Studies on friendships in online social networks involving geographic distance have so far relied 
on the city location provided in users’ profiles. Consequently, most of the research on friendships 
have provided accuracy at the city level , at best, to designate a user’s location. This study analyzes 
a Twitter dataset because it provides the exact geographic distance between corresponding users. 
We start by introducing a strong definition of “friend” on Twitter (i.e., a definition of bidirectional 
friendship ), requiring bidirectional communication. Next, we utilize geo-tagged mentions delivered 
by users to determine their locations, where “@username” is contained anywhere in the body of 
tweets. To provide analysis results, we first introduce a friend counting algorithm. From the fact 
that Twitter users are likely to post consecutive tweets in the static mode, we also introduce a 
two-stage distance estimation algorithm. As the first of our main contributions, we verify that the 
number of friends of a particular Twitter user follows a well-known power-law distribution (i.e., a 
Zipf’s distribution or a Pareto distribution). Our study also provides the following newly-discovered 
friendship degree related to the issue of space: The number of friends according to distance follows a 
double power-law (i.e., a double Pareto law) distribution, indicating that the probability of befriending 
a particular Twitter user is significantly reduced beyond a certain geographic distance between users, 
termed the separation point. Our analysis provides concrete evidence that Twitter can be a useful 
platform for assigning a more accurate scalar value to the degree of friendship between two users. 
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Befriend, bidirectional friendship, complex network, double power-law, geo-tagged mention, 
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I. Introduction 

In recent years, research in the field of online social networks (OSNs) has grown dra¬ 
matically with the evolution of technologies while harnessing Big Data. Focusing on the 
relationships (edges) among users or profiles (vertices), OSN analysis has emerged as one 
of the most popular and familiar approaches for examining interaction, information sharing, 
and collaboration among online users El- Simultaneously, the field of complex networks 
has emerged as an independent research area, with strong connections to random graph 
theory from mathematics as well as to social network analysis by physicists, interested in 
understanding the behaviors of large-scale interacting networks. Based on massive datasets of 
large-scale real-world OSNs such as Twitter 0, Facebook 0, Flickr 0), and Foursquare 0, 
extensive studies have validated that the small-world phenomenon (originally introduced 
by Watts and Strogatz 0) and scale-free degree distribution^ which are the two most 
representative features of complex networks, nearly hold in OSNs 0. Twitter is one of the 
most popular micro-blogs (or social media), allowing users to “tweet” about any topic within 
the 140-character limit and to “follow” others to receive their tweets. At the start of 2015, 
Twitter played a vital role in facilitating social contacts, boasting 284 million active users per 
month, publishing 500 million tweets daily from their web browsers and smart phonesH 

A. Related Work 

To understand the nature of friendships online with respect to geographic distance, some 
efforts have focused on users’ online profiles that include their city of residence 0, 0- 
In 0 , experimental results based on the Live Journal social network^] demonstrated a close 
relationship between geographic distance and probability distribution of friendship, where the 
probability of befriending a particular user on LiveJournal is inversely proportional to the 
positive power of the number of closer users. Contrary to 0, based on the data collected 
from Tuenti0 a Spanish social networking service, it was found in 0 that social interactions 
online are only weakly affected by spatial proximity, with other factors dominating. 

However, the effect of distance on online social interactions has not yet been fully under¬ 
stood. In the previous studies, the geographic location points only to the location of users 
at a city scale. For this reason, the friendship degree distribution contains a background 
probability that is independent of geography due to the city-scale resolution 0 , 0 . On the 
other hand, geo-located Twitter can provide high-precision location information down to 10 
meters through the Global Positioning System (GPS) interface iTToTl of users’ smart phones 
while offering comprehensive metadata with a gigantic sample of the whole population. 

For this reason, there is extensive and growing interest among researchers to understand a 
variety of social behaviors through geo-located Twitter, or, equivalently, geo-tagged tweets ifTTTl — 
El. Even if geo-tagged tweets account for approximately 1% of the total amount f[20l . thanks 
to the increasing penetration of smart devices and mobile applications, the volume of geo- 
located Twitter has grown constantly and now forms an invaluable register for understanding 
human behavior and modelling the way people interact in space. In ETIl . along with geo¬ 
locations for collected tweets, analysis included how geo-related factors such as physical 
distance, frequency of air travel, national boundaries, and language differences affect forma¬ 
tion of social ties on Twitter. In llT2ll . it was found that the geo-locations of Twitter users across 

'A “small-world” network is a type of mathematical graph in which two arbitrary nodes (people) are connected by a short 
chain of intermediate links (friends), and a “scale-free” network is a network whose degree distribution follows a power-law. 

2 https://about.twitter.com/company 

3 https ://www. livej ournal. com 

4 https://www.tuenti.com 
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different countries considerably impact their participation in Twitter, their connectivity with 
other users, and the information that they exchange with each other. As another application, 
the use of geo-tagged tweets was evaluated as a complementary source of information for 
urban planning including i) a technique to determine land uses in a specific urban area based 
on tweeting patterns and ii) a technique to identify urban points of interest at places with high 
tweeting activity Ifl3ll . New approaches based on geo-tagged tweets were also proposed to find 
top vacation spots for a particular holiday by applying indexing, spatio-temporal querying, 
and machine learning techniques lfl4ll and to detect unusual geo-social events by measuring 
geographical regularities of crowd behaviors lfT5l . 

Benefiting from the increasing availability of location information from geo-tagged tweets, 
there has been a steady push to understand individual human mobility lfT6ll — lfl9il . which is 
of fundamental importance for many applications to human and electronic virus prediction 
and traffic and population forecasting. Recent effort has focused on the studies of human 
mobility using tracking technologies such as mobile phones Il2l1l - [l24ll . GPS receivers ll25l . 
WiFi logging lf26ll . Bluetooth ll27ll . and RFID devices lf28ll as well as location-based social 
network check-in data lf29ll . but these technologies involve privacy concerns or data access 
restrictions. In contrast, geo-tagged tweets can capture much richer features of human mobility. 
For example, in llT6ll . global human mobility patterns were widely revealed, and a comparative 
study on the mobility characteristics of different countries was conducted. Furthermore, it was 
found in iflTll that the geo-located Twitter data for Australia reveals multiple modes of human 
mobility from intra-site to metropolitan and inter-city movements. As another point of view, 
in d, it was reported that in Australia, the gravity law is applicable for estimating human 
mobility by showing that mobility between an origin and its destination is proportional to 
the product of populations of these two places and is inversely proportional to the power-law 
of distance between them. In ffl9ll . the problem of labelling the places of a city based on 
collected spatio-temporal data was addressed, including i) to infer whether a place belongs to 
a certain category or not and ii) to choose the category of a place among a set of categories. 

B. Main Contributions 

In our work, we utilize geo-tagged mentions on Twitter, sent by users, to identify their 
exact location information. A ‘mention’ in Twitter consists of inclusion of “@username” 
anywhere in the body of tweets. From the fact that we tend to interact offline with people 
living very near to us, we derive as a natural extension the question whether geography and 
social relationships are inextricably intertwined on Twitter. Our research significantly differs 
from a variety of studies on human mobility in the literature lfT6ll — ll2Tll — lf29ll since it is 
interested in how a pair of users interacts. To the best of our knowledge, such an attempt 
to analyze one-to-one friendship based on geo-located tweets (or mentions) has not yet been 
described in the literature. 

As people normally spend a substantial amount of time online, data regarding these two 
dimensions (i.e., geography and online social relationships) are becoming increasingly precise, 
thus motivating us to build more reliable models to describe social interactions Poll . Previous 
studies have employed large amounts of data from diverse sources, such as smart devices 
and web-based applications, to examine how social data resources (e.g., photos on Flickr) are 
processed with tagging OTTl . Il32ll . Both a co-clustering approach OTTl and a spatial ranking 
approach Il32ll have been introduced to discover meaningful relationships between a set of 
relevant resources and a set of tags. This paper goes beyond past research to determine how 
friendship patterns are geographically represented by Twitter, analyzing a single-source dataset 
(to avoid potential confounds) that contains a huge number of geo-tagged mentions from users 
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in i) the state of California in the United States (US) and Los Angeles (the most populous 
city in the state) and ii) the United Kingdom (UK) and London (the most populous city in the 
UK). These two location sets were selected as demographically comparable, yet distinct and 
geographically separated, leading adopters of Twitter with sufficient data to enable meaningful 
comparative analysis for our intentionally exploratory study (which will be specified in Section 
II). In this dataset, each mention record has a geo-tag (spatial information) and a timestamp 
(temporal information) indicating from where, when, and by whom the mention was sent. We 
propose and apply the following new framework, which establishes a more accurate friendship 
degree on Twitter, and a method to enable analysis based on geographic distance: 

• To fully take into account the intensity of communication between users, we start our 
analysis by introducing a rather strong definition of “friend ’ on Twitter, i.e., a definition of 
bidirectional friendship, instead of naively considering the set of followers and followees 
(unidirectional terms). This definition requires bidirectional communication within a 
designated time frame to constitute a friendship. 

• Using the above definition, we introduce a friend counting algorithm, which computes 
the distribution of the number of friends for each Twitter user. 

• By showing that almost all Twitter users are likely to post consecutive tweets in the static 
mode, we propose a two-stage distance estimation method, where the geographic distance 
between two befriended users (denoted by Users u and v) based on our definition of 
bidirectional friendship is estimated by sequentially measuring the two senders’ locations. 
More specifically, the location of User u is recorded at the moment when User u sends 
a mention to User v, while the location of User v can also be recorded when User v 
sends a replied mention to User u at the next closest time, enabling estimation of the 
distance between Users u and v. 

Note that the above definition is suitable for evaluating one-to-one bidirectional social 
interactions on Twitter since Twitter users tend to personally interact with only a few of their 
followers/followees by sending and receiving direct mentions. We would like to synthetically 
analyze how the geographic distance between Twitter users affects their interaction, based on 
our new framework. Our main contributions are as follows: 

• Based on the definition of bidirectional friendship, we first verify that the number of 
friends of one user on Twitter follows a power-law distribution (i.e., a Zipf’s dis¬ 
tribution ll33ll or a Pareto distribution lf34lB even on Twitter, which is known to be 
asymptotically equivalent to the degree distribution of scale-free networks. This finding 
is consistent with the earlier results in other OSNs. 

• Next, more interestingly, we characterize a newly-discovered probability distribution 
of the number of friends according to geographic distance, which does not follow a 
homogeneous power-law but, instead, a double power-law (i.e., a double Pareto law li35l l. 
From this new finding, we identify not only two fundamentally separate regimes, termed 
the intra-city and inter-city regimes, which are characterized by two different power-laws 
in the distribution, but also the separation point between these regimes. 

C. Organization 

The rest of this paper is organized as follows. Section II describes the dataset, and Section 
III explains our analysis methodology. In Section IV, experimental results are presented by 
analyzing the number of friends of a particular user and the number of friends with respect 
to distance. Finally, we summarize the paper with some concluding remarks in Section V. 
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II. Dataset 

We use a dataset collected from crawling the Twitter network via Twitter Streaming Appli¬ 
cation Programming Interface (API)H which returns tweets matching a query provided by the 
Streaming API user. Although the Twitter Streaming API only returns at most a 1% sample 
of all the tweets produced at a given moment, it constitutes a valid representation of users’ 
activity on Twitter when more specific parameter sets such as different users, geographic 
bounding boxes, and keywords are created (thereby enabling extraction of more data from 
the Streaming API) ll20l . If36ll . It was found that the Streaming API returns an almost complete 
set of geo-tagged tweets despite sampling lf20ll . Thus, there is no doubt that this research is 
working with an almost complete sample of geo-located Twitter data. 

In our work, we examined data from all possible devices (sources) that indicate the user’s 
location information at the time that they access Twitter. The statistics based on our dataset 
demonstrate that a large majority of the Twitter users in our sample posted geo-tagged tweets 
through smart phones rather than web browsers on a desktop or laptop computer!! This reveals 
that our dataset is much more inclined toward geo-tagged tweets (more rigorously, geo-tagged 
mentions) transmitted through the GPS interface. 

The dataset consists of a huge amount of geo-tagged mentions recorded from Twitter users 
from September 22, 2014 to October 23, 2014 (about one month) in the following four large 
regions: California, Los Angeles, the UK, and London. Note that this short-term (one month) 
dataset is sufficient to examine how closely one user has recently interacted with another 
online (i.e., a personal online relationship between two users). The four regions in our dataset 
were selected since they are quite comparable at both the macro (state or country) and micro 
(city) scales in terms of i) area, ii) population density, and iii) Twitter popularity (e.g., the 
number of Twitter accounts or the number of posted tweets). The comparison between location 
sets for the aforementioned three representative attributes is summarized in TABLE I, divided 
according to the types of two geographic scales0 

The representative statistics of the collected dataset, such as the total number of mentions 
and the total number of senders, are also summarized by regional group in TABLE II. In this 
dataset, each mention record has a geo-tag and a timestamp indicating from where, when, 
and by whom the mention was sent. Based on this information, we are able to construct a 
user’s location history denoted by a sequence L = (xki, yki- U), where x^, and are the x- 
and //-coordinates of User k at time t, , respectively. The location information provided by the 
geo-tag is denoted by latitude and longitude, which are measured in degrees, minutes, and 
seconds. 

Each mention on Twitter contains a number of entities that are distinguished by their 
attributed fields. For data analysis, we adopted the following five essential fields from the 

5 https://dev.twitter.com/decs/streaming-apis 

6 We note that smart devices and mobile applications enable us to provide high-precision location information through 
the built-in GPS interface. On the other hand, with the Geo-location API, web browsers can detect the users’ approximate 
location information inferred from network signals such as IP address, WiFi, Bluetooth, MAC address, and GSM/CDMA 
cell ID, which are not guaranteed to return the users’ actual location. Based on our dataset, it is found that 77.84% and 
82.21% of Twitter users tend to post geo-tagged tweets in California and the UK, respectively, via iPhone and Android 
Phone, which are the smart phone types using the two most popular mobile platforms among all devices. It is also found 
that 90.52% and 81.14% of posted geo-tagged tweets tend to be recorded in California and the UK, respectively, via iPhone 
and Android Phone. 

7 http://en. wikipedia.org/wiki/Califomia 
http://en.wikipedia.org/wiki/United_Kingdom 
http://en.wikipedia.org/wiki/Los_Angeles 
http://en.wikipedia.org/wiki/London 

http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US 
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(a) California versus UK (state scale or country scale) 


Attribute 

California 

UK 

Area (km 2 ) 

423,970 

243,610 

Population density (population/km 2 ) 

95.0 

225.6 

Global ranking among countries 
by the number of Twitter accounts 

1st (US as whole country) 

4th 


(b) Los Angeles versus London (city scale) 


Attribute 

Los Angeles 

London 

Area (km 2 ) 

1,302 

1,572 

Population density (population/km 2 ) 

3,198 

5,354 

Global ranking among cities 
by the number of posted tweets (June 2012) 

8th 

3rd 


TABLE II 

Statistics of the dataset: The number of mentions and unique users in each region. 


Region 

Number of mentions 

Number of users (senders) 

California 

2,349,901 

217,439 

Los Angeles 

918,360 

51,625 

UK 

3,721,716 

612,368 

London 

614,045 

58,046 


metadata of mentions H 

• user_id_str. string representation of the sender ID 

• in_reply_to_user_id_str. string representation of the receiver ID 

• lat : latitude of the sender 

• Ion : longitude of the sender 

• created_at : UTC/GMT time when the mention is delivered, i.e., the timestamp 

Note that the two location fields, lat and Ion, corresponds to spatial (geo-tagged) information 
while the last field, created_at, represents temporal (time-stamped) information. 

III. Research Methodology 

We start by introducing the following definition of “bidirectional friendship” on Twitter. 

Definition 1 (Bidirectional friendship in Twitter): If two users send/receive direct mentions 
to/from each other (i.e., bidirectional personal communication occurs) within a designated 
amount of time, then they form a bidirectional friendship with each other. 

Note that our definition differs from the conventional definition of “friend” on Twitter, 
which is referred to as a followee and thus represents a unidirectional relation 0711 . P8ll n 

8 https://dev.twitter.com/overview/api/tweets 

9 Twitter shows a low level of reciprocity; 77.9% of user pairs with any link between a Twitter user and his/her follower 
are connected one-way, and only 22.1% exhibit a reciprocal relationship between them (i.e., two-way links) m. 
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Fig. 1. One example that illustrates how geo-tagged mentions are delivered from senders to receivers according to time 
sequence, where UjJ and denote the transmitter and the corresponding receiver at time t G {0,1, • • • }. In this example, 
three pairs of friends, (uo,U 2 ), (^ 1 ,^ 2 ), and (ui,U 3 ), are made among four users uq, ui, U2, and U3. 


Since friendship relations in the offline world and on other OSNs such as Facebook ll39ll 
are generally not unidirectional, our intention is to formulate a bidirectional friendship that 
can be directly applicable to offline relationships. This strong definition enables exclusion of 
inactive friends (or passive friends) who have been out of contact online for a long designated 
amount of time (e.g., about one month in our work) and to count the number of active friends 
who have recently communicated with each other. 

A. Counting Number of Friends of a Particular User 

In this subsection, we explain how to count the number of friends of each user who sent 
at least one geo-tagged mention. Suppose that there are four Twitter users, denoted by u 0 , 
ui, u 2 , and u 3 , who sent or received at least one geo-tagged mention according to temporal 
event sequences, as illustrated in Figure UJ Here, and i/j^i denote the transmitter and the 
corresponding receiver sequentially at time instance t £ (0,1, • • •}. In this example, according 
to the aforementioned definition, three pairs of friends (w 0 , u 2 ), («i, u 2 ), and (ui,uf) are found 
out of the above user set. Moreover, one can find that the number of friends of each user 
«o, U\, u 2 , and u 3 is given by 1,2, 2, and 1, respectively. In our framework, if bidirectional 
communication between two certain users occurs at least once, then their friendship degree is 
set to one. Otherwise, it is set to zero, i.e., no friendship between the two users is created. That 
is, even with more than two bidirectional communications between two users, their friendship 
degree is maintained at one in this binary or Boolean evaluation. In our sample space, we 
exclude the user set whose friendship degree is zero since including such users will lead to 
scaling down the probability distribution of the nonzero number of friends. 

The overall procedure of the friend counting algorithm (Algorithm 1) is described in 
TABLE M where n u denotes the number of friends of User u e {u 0 ,ui, ■ ■ ■ who 

sent a geo-tagged mention to User v e {t’o, vi, ■ ■ ■ , vj-f), and I and J are the total number 
of senders and receivers in a dataset, respectively. 
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TABLE III 

The overall procedure of the friend counting algorithm. 


Algorithm 1 Friend counting algorithm 

Input: Ujl and u^l for t = 0,1, • • • , T — 1, u G {u 0l ui, • • • , 

and v e • • • ,vj- 1 } 

Output: n u for all u 

Initialization: c uv <— 0 and n u 0 for all u and v 
00: for t <— 0 to T — 1 do 

01: Find the user indices u and v for u^l and ujQ, respectively 

02: for s <(— f + 1 to T — ldo 

03: if (4 S x == «S) then 

04: if (ttp’x == 4x) then 

05: c uv <— 1 

06: break (go back to line 00) 

07 : end if 

08: end if 

09: end for 

10: end for 
11: for all u and v do 
12: n u <— n u + c uv 

13: end for 


B. Finding Friend Distribution With Respect to Distance 

In this subsection, let us turn to characterizing the friendship degree of individuals regarding 
geography by analyzing their sequences L = (x ui , y ul . t {) of geo-tagged mentions, where only 
the senders’ location information is recorded. We propose a two-stage method to estimate the 
geographic distance between Twitter friends. If User u sends a mention to User v, then the 
location information of User u is recorded (the first stage). In order to find the location of 
User v, we need to wait for the moment at which User v sends a mention back to User u (the 
second stage). That is, after bidirectional communication between two Twitter users occurs, 
the location of each user can be identified. 

It is not possible to evaluate the geographic distance between two Twitter users through a 
one-shot process due to the fact that the location information of only the sender is recorded at a 
given instance when a geo-tagged mention is sent. Moreover, because of the users’ movements, 
it is, however, not straightforward to measure the exact distance. In this subsection, we 
introduce a two-stage distance estimation method, where the geographic distance between 
two befriended users is estimated by sequentially measuring the two senders’ locations. 

Before describing the estimation algorithm, let us first focus on the time interval between 
the following two events for a befriended pair: a mention and its replied mention at the next 
closest time. We count only the events with a time duration between a mention and its replied 
mention, or inter-mention interval, of less than one hour to exclude certain inaccurate location 
information that may occur due to users’ movements^ Figure |2] illustrates the instance for 
which User u, originally placed at (x u0 , y uQ , t 0 ), sent a mention to User v at (x vQ , y v0 , to), and 

10 Note that inter-mention interval of one hour may be shortened, but this will lead to a reduction in the available dataset. 
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Fig. 2. User movement in which User k C {u, v} changes location from (xko, Vko, to) to (xki,yki,ti) between sending 
a geo-tagged mention and receiving a corresponding replied mention. 


then received a replied mention at the location (x u i, y u i, h) from User v placed at (x v \, y v \ ,t\). 
Here, the single solid arrows indicate the actual distances at time instances t 0 and t\ while the 
double solid arrow indicates the estimated distance. The distance that users moved between 
the two moments in time t 0 and t\ (i.e., inter-mention interval) is indicated as the dashed 
arrows in the figure. From these two consecutive mention events, it is possible to estimate 
the geographic distance based on the two sequences (x u0 , y uQ , t 0 ) and (x v \, y v \ ,t\). In our 
framework, by assuming that the Earth is spherical, we deal with the shortest path between 
two users’ locations measured along the surface of the Earth, instead of the rather naive 
straight-line Euclidean distance. Following an approach similar to that employed in flOl 
ED, the distance between two locations on the Earth’s surface can be computed according 
to the spherical law of cosinesO Then, when we denote the distance between the two users 
measured from (x u0 ,y u0 ,t 0 ) and (x vl ,y v i,ti) by d^v, we obtairfl 


= R cos 1 (sin x u0 sin x vl + cos x u0 cos x vl cos (y vL - y u0 )), (1) 

where R [in kilometers (km)] denotes the Earth’s radius and is given as 6,371, and the 
superscript 0 in duv represents the time slot. Here, for notational convenience, it is assumed 
that the x- and ^-coordinates represent the latitude and longitude, respectively. 

While the estimated distance (double solid arrow in Figure [2]> may differ from the actual 
distance (single solid arrow in Figure |2]) between Users u and v at time t\, it is worth noting 
that people tend to send/receive multiple consecutive tweets from the same location to convey 
a series of ideas ifTTIl . Iffifll . To validate this user mobility argument, we turn our attention to 
analyze the distribution of the number of tweets (i.e., the tweet frequency) with respect to 
user velocity. 

In our experiments, we use the same dataset collected from the Twitter users as shown in 
Section III but focus on the two populous metropolitan areas, Los Angeles and London. To ex¬ 
clude certain inaccurate location information that may exist due to users’ movements, we take 
into account only the case only where two consecutive geo-tagged tweet events occur within 

"When Sinnott published the haversine formula 1421 . computational precision was limited. Nowadays, JavaScript (and 
most modern computers and languages) uses IEEE 754 64-bit floating-point numbers, which provide 15 significant digits of 
precision. With this precision, the simple spherical law of cosines formula gives well-conditioned results down to distances 
as small as around 1 meter. In view of this, it is probably worth, in most situations, using the simpler law of cosines in 
preference to the haversine formula. 

12 http://math world. wolfram.com/SphericalTrigonometry.html 
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Fig. 3. Probability distribution of the tweet frequency with respect to user velocity (log-linear plot). 


one hour. When the location history for two consecutive geo-tagged tweets of User k at time 
slots U and t l+ \ is expressed as sequences (x kl , y kl , U) and (xk(i+i),yk(i+i), U+i), respectively, 
the average velocity v£* of the user within this time interval is given by v^ } = S k /(U+i — t t ), 
where is the distance that User k moved during the interval [U,t i+ 1 ] and thus is given 
by = -Rcos -1 (sinxfei sin^^+i) + cosx ki cosx k{j , +l) cos (y k (i+i) ~ Vki )) (refer to equation 
© for more details). From the set of average velocities j4°\ v< k \ • • • , v k ^ l) j obtained from 
all users in the dataset, the tweet frequency can be categorized according to the user velocity. 

Figure [3] shows the log-linear plot of the distribution of the number of tweets (i.e., the 
tweet frequency) versus the user velocity [km/h], which is obtained from empirical data. As 
illustrated in Figure 0 most of the Twitter users (approximately 90%) in the two metropoli¬ 
tan areas are likely to post consecutive tweets in the static mode whose average velocity 
ranges from 0 to 2 km/h. Our experiments also demonstrate that Twitter users in large scale 
geographic areas (e.g., state scale (California) or country scale (the UK)) are more likely to 
post consecutive tweets in the static mode than city-scale users, even if the results are not 
presented in Figure [3] Although the inter-tweet interval may show a different pattern from 
that of the inter-mention interval (i.e., the time duration between a mention and its replied 
mention from another user), we believe that the above results are sufficient to support our 
analysis methodology. 

Now, we are ready to present our distance estimation algorithm (Algorithm 2). The 
overall procedure of the proposed algorithm is described in TABLE [IVl where d uv de¬ 
notes the estimated geographic distance between user pair u £ {r/ 0; «i, • • • , uj -\} and v £ 
{t>o,fi, • • • ,vj- 1 }, and I and J are the total number of senders and receivers in a dataset, 
respectively. Note that as shown in lines 14-18 of the table, the estimated distance for one 
pair is obtained by taking the average of all distance values computed over the available 
inter-mention intervals, each of which is less than one hour. 

IV. Analysis Results 

In this section, we first verify whether a Zipf’s power-law holds for the Twitter network 
along with the definition of bidirectional friendship. Next, we show a newly-discovered 
distribution of the number of friends with respect to the geographic distance and then identify 
the two fundamentally separated regimes in the distribution. 
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Algorithm 2 Distance estimation algorithm 

Input: Ujl and for t = 0. L • • • . T - 

-1, 

u e {u 0 , 

«!,••• ,Uj- 1} 

and v e {v 0 ,v 1 , ■ ■ ■ ,vj_ i} 




Output: d uv for all u and v 




Initialization: Cul <— 0 and d uv <— 0 for 

all 

u and v 


00 

for t f— 0 to T — 1 do 




01 

Find the user indices u and v for 

(t) 

?/ Tx 

and 4x’ 

respectively 

02 

for s i — t - 1-1 to T — 1 do 




03 

if (ujx == ) then 




04 

if ( u rI == 4x) then 




05 

if (time interval between t and s 

< 1 hour) then 

06 

(c (*)) 

Compute d U v v in equation 

i (ID 



07 

( t ) (t) . i 

C U v f— C U v T" 1 




08 

break (go back to line 00) 




09 

end if 




10 

end if 




11 

end if 




12 

end for 




13 

end for 




14 

for all v and v do 




15 

for l <— 0 to oil do 




16 

d (j 4- r/® /r^ 

^UV * ^UV ' Uuv / ^uv 




17 

end for 




18 

end for 





A. Number of Friends of a Particular User 

We first find that the probability distribution Pn{N = n) of the number of friends for an 
individual, denoted by n, on Twitter fits into a single power-law function P n (N = n) ~ n~ a 
for a > 0. Figure 0] shows the log-log plot of the distribution Pn(N = n ) obtained from 
empirical data, logarithmically binned data, and fitting function, where the fitting is applied 
to the binned data. As depicted in the figure, statistical noise exists in the tail where the 
number of friends is very large. Such noise can be eliminated by applying logarithmic binning, 
which averages out the data that fall in specific bins ll43ll l 13 l We use the traditional least 
squares estimation to obtain the fitting function. In TABLE |V1 the value of the exponent of 
Pn{N = n), a, is summarized for each region. From Figure 0] and TABLE |V1 the following 
interesting comparisons are performed according to types of regions: 

• Comparison between the city-scale and state-scale/country-scale results: Figures |4(a)| 
and |4(b)| illustrate that the exponent a is 3.48 and 2.29 in California and Los Angeles, 
respectively, which implies that Twitter users in populous metropolitan areas are more 

13 It is also verified that this binning procedure does not fundamentally change the underlying power-law exponent of the 
distribution Pn(N = n). 











Distribution P N (N=n) Distribution P N (N=n) 
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Fig. 




(a) California 


(b) Los Angeles 




4. Probability distribution Pn(N = n ) of the number of friends of a particular user (log-log plot). 

TABLE V 

The value of a for each region. 


Region 

a 

California 

3.48 

Los Angeles 

2.29 

UK 

2.54 

London 

2.01 


likely to contact a higher number of friends within a given period (e.g., one month). 
From Figures |4(c)| and |4(d)j the same trend is also observed by comparing the results for 
the UK and London, with a values of 2.54 and 2.01, respectively. That is, urban people 
are likely to bilaterally interact with more friends by sending and receiving directed 
geo-tagged mentions, compared on average to people in larger regions that include local 
small towns. 

Comparison between the results in the two cities (Los Angeles and London): From 
Figures |4(b)1 and |4(d)| one can see that the exponent a is 2.29 and 2.01 in Los Angeles and 
London, respectively. This reveals that Twitter users in London tend to contact a slightly 
higher number of friends within a given period, compared to users in Los Angeles. There 
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may be many explanations for this phenomenon, including that i) London is one of the 
world’s most famous tourist destinations, which would attract relatively more visitors 
to use Twitter to send/receive direct mentions to/from their friends in the city and ii) 
London has a relatively higher population density than that of Los Angeles (refer to 
TABLE I for more details). 


B. Number of Friends With Respect to Distance 

The most interesting characteristic in friendship degrees is how friends of a user are 
distributed with respect to the geographic distance between the Twitter user and his/her friend. 
In this subsection, similarly as in 10, we also verify whether Twitter users establish more 
relationships with friends who are living in geographic proximity to each other. As mentioned 
before, in our experiments, we use geo-tagged mentions to identify the location information 
of a user when he/she sent a mention to his/her friend. To detect his/her friend’s location, we 
then observe replied geo-tagged mentions that were sent at the next closest time. Using these 
bidirectional mentions, we characterize the probability distribution P d (D = d) of the number 
of friends according to the distance d, where d [km] is the geographic distance between a 
user and his/her friend. 

Unlike the earlier work in [0], the heterogeneous shape of Pn(D = d) for the entire interval 
cannot be captured by a single commonly-used statistical function such as a homogeneous 
power-law using the approach of parametric fitting. Interestingly, as our main result, we 
observe that for the distance d e [d min , d max ], Pd(D = d) can be described as a double 
power-law distribution, which is given as: 


P d (D = d) ~ 


d 71 if d m i n < d < d s (intra-city regime) 
d~ 12 ii d s < d < d max (inter-city regime), 


( 2 ) 


where 71 and 72 denote the exponents for each individual power-law and d s is the sep¬ 
aration point. This finding indicates that the friendship degree can be composed of two 
separate regimes characterized by two different power-laws, termed the intra-city and inter¬ 
city regimes. Figure |5] shows the log-log plot of the distribution Pd(D = d) from empirical 
data, logarithmically binned data, and fitting function, where the fitting is applied to the 
binned data. As in Section IIV-A1 we also use the traditional least squares estimation to obtain 
the fitting function@ In TABLE IVTl the value of the exponents of Pn(N = n ), 71 and 72 , is 
summarized for each region. 

Unlike the earlier studies in J5), 0 that do not capture the friendship patterns in the intra¬ 
city regime, our analysis exhibits two distinguishable features with respect to distance. More 
specifically, in each regime, the following interesting observations are made: 

• In the intra-city regime, the distribution Pd(D = d) decays slowly with distance d, 
which means that geographic proximity weakly affects the number of intra-city friends 
with which one user interacts. That is, in this regime, the geographic distance is less 
relevant for determining the number of friends. This finding reveals that more active 
Twitter users tend to preferentially interact over short-distance connections. 

• In the inter-city regime, Pd(D = d) depends strongly on the geographic distance, where 
there exists a sharp transition in the distribution Pd(D = d) beyond the separation point 
d s . Thus, long-distance communication is made occasionally. 


l4 Using maximum likelihood estimation to fit a mixture function (e.g., a double power-law function) is not easy to 
implement and the performance of mixture functions has not been well understood. 
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(b) Los Angeles 




Fig. 5. Probability distribution Pd(D = d) of the number of friends with respect to distance (log-log plot). 


TABLE VI 

The value of 71 and 72 for each region. 


Region 

7i 

72 

California 

0.60 

1.39 

Los Angeles 

0.60 

6.23 

UK 

0.69 

1.47 

London 

0.38 

7.13 


The above argument stems from the fact that the separation point d s is closely related 
to the length and width of the city in which a user resides. From these observations, we 
may conclude that within a given period, the individual is much more likely to contact online 
mostly friends who are in location-based communities that range from the local neighborhood, 
suburb, village, or town up to the city level. In addition, the following interesting comparisons 
are performed according to types of regions: 

• Comparison between the city-scale and state-scale/country-scale results: We observe 
that the separation point d s in populous metropolitan areas is much greater than that 
in larger regions that include local small towns (such as at the state or country level). 
For example, from Figures |5(a)| and |5(b)| , we see that d s is approximates 8 km and 
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22 km in California and Los Angeles, respectively. From Figures |5(c)| and |5(d)[ the 
same trend is observed by comparing the results for the UK and London (18 km and 
21 km, respectively). This finding reveals that Twitter users in populous metropolitan 
areas (e.g., Los Angeles and London) have a stronger tendency to contact friends on 
Twitter who are geographically away from their location (i.e., interacting over long¬ 
distance connections). This is because the average size (referred to as the land area) of 
the considered metropolitan cities is relatively bigger than that of cities in larger regions 
including small towns. Furthermore, it is seen that the exponent in the inter-city regimes 
(i.e., 72 ) in metropolitan areas is significantly higher than that in larger regions. Unlike 
the state-scale/country-scale results, this finding implies that the distribution Pd(D = d ) 
sharply drops off beyond d s in huge metropolitan areas. 

• Comparison between the results in the two cities (Los Angeles and London): From 
Figures |5(b)| and |5(d)[ one can see that 71 is 0.60 and 0.38 and 72 is 6.23 and 7.13 
in Los Angeles and London, respectively. Thus, in the intra-city regime, the geographic 
distance is less relevant in London for determining the number of friends. However, in 
the inter-city regime, the distribution Pd(D = d) in London shows a bit steeper decline. 

Our geo-tagged Twitter data provides position resolution at up to 10 meters, compared to 
the typical city-scale resolution in previous studies on friendship [ 0 , thus allowing much 
more fine-grained validation of these heterogeneous behaviors in terms of distance. 


V. Concluding Remarks 

The present work has developed a novel framework for analyzing the degree of bidirectional 
online friendship via Twitter, while not only utilizing geo-tagged mentions but also introducing 
a definition of bidirectional friendship. To provide analysis results, we first introduced two new 
algorithms, the first for counting friends and the second for a two-stage distance estimation 
algorithm. We verified that the homogeneous power-law model, also known as Zipf’s law, 
holds on Twitter in terms of the number of friends of one user. More interestingly, we compre¬ 
hensively demonstrated that the number of friends according to geographic distance follows a 
double power-law distribution, or equivalently, a double Pareto law distribution, where there 
exists a strict separation point in distance that distinguishes the intra-city regime from the inter¬ 
city regime. Our analysis sheds light on a new understanding of social interaction/relationships 
online with regard to small-scale space as well as large-scale space. 

Characterization of the degree of friendship in space along with a greater variety of 
city/state/country-scale data on Twitter remains for future work. Suggestions for further re¬ 
search in this area also include analyzing a new friendship in the temporal domain (time) by 
utilizing geo-located Twitter. 
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